Introduction to PUMS

The Public Access Microdata dataset

About the data

  • Collected by the United States Census Bureau as part of the American Community Survey
  • Disclosure protection — introduces noise to make it impossible to identify specific people or households
  • Covers: 2005–2022 using the 1-year estimates (sans 2020; COVID)
  • Split into person and household
    • columns: person: 230, household: 188
    • rows: person: 53M, household: 25M

A few example variables

  • Person
    • Language spoken at home (LANP)
    • Travel time to work (JWMNP)
  • Household
    • Access to internat (ACCESS)
    • Monthly rent (RNTP)
  • Weights 😵‍💫
    • PWGTP and WGTP for weights

Format of the data

  • Released and available as CSV files (~90GB)
  • Uses survey-style coding

For this workshop:

  • Recoded the dataset
  • Saved as parquet (~12GB) partitioned by year and state

Can I analyze all of PUMS?

Most analysis of PUMS data starts with subsetting the data. Either by state (or even smaller) or year and often both.

But with the tools we learn about in this workshop, we actually can analyze the whole dataset.

What can I do?

What can I do?

Caveat

Though we have not purposefully altered this data, this data should not be relied on to be a perfect or even possibly accurate representation of the official PUMS dataset.